home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Turnbull China Bikeride
/
Turnbull China Bikeride - Disc 2.iso
/
STUTTGART
/
FROMUTS
/
WORDCHECK
/
!WordChk
/
README
< prev
next >
Wrap
Text File
|
1992-03-17
|
13KB
|
365 lines
Word Check Module (Formally SpellChk)
==================
RiscOS 3 version 0.03 © Geoff. Lane. Mar 1992
Internet: zzassgl@uts.mcc.ac.uk
Janet : zzassgl@uk.ac.mcc.uts
(The information given here may not exactly match the current state of
the module.)
Introduction
------------
This implements a word spelling check module for British/American
technical English (with a selection of Arc specific words added for
good luck) or other (unsupplied)languages. It is based on a
description in Chapter 13 of "Programming Pearls" by Jon Bentley (ISBN
0-201-10331-1) of a clever algorithm devised by Doug McIlroy in 1978.
Many RiscOS based word processors contain a spelling checker; each has
its own word list and interface. These checkers are impossible to use
outside the application. For each application there will be one or
more large diictionary files each unique to the application. If a
standard interface were created using SWIs it would be possible of one
module to provide Word check function to all applications that
required such a facility. If an application could rely on a Word
check module being available then the application would be smaller.
The facilities provided within this module are a first attempt to
define such an interface.
This program was created and tested on a 2M A5000 machine running
RiscOS 3. The compiler used was Norcroft C version 3.0. The shared
library version used was 3.87 (I haven't included any RMEnsure
commands in the !Boot or !Run files - I'm not aware of any version
dependancies) If you don't have the C library in ROM then you should
edit the files to check for and load the shared library.
Algorithm
---------
The algorithm allows a dictionary to be highly compressed by encoding
each word as a unique 32 bit number. The resulting list of numbers is
sorted and then a table of differences is created. This table of 16
bit numbers is included in the module. To find a word in the table
you encode the word and then see if it is possible to generate the
same value by summing the contents of the table - if you get a match
then the word is valid. (Of course you have to chose the encoding
algorithm quite carefully to ensure that the vast majority (in this
case only 19 words hash to the same value) of words translate to
unique numbers and the differences between each pair of sorted numbers
is always less than 32K.) Thus a module of about 71K bytes can be used
to check the spelling of about 32000 different words.
Building Word Lists
-------------------
The word lists were created from a number of sources...
* Take 1.5 Gbytes of Netnews.
* Take the Brown University Corpus of English Usage.
* Take the Unix online manual pages.
* Take the RiscOS help text.
Delete the non-alphabetics, sort and delete duplicates. From this you
obtain a huge list of words plus a lot of junk. You pass this list
through a standard spelling checker and then check the reject list for
words that are useful but not accepted.
Features
--------
* Installed as a module.
* Large dictionary but small memory requirements. (To an old BBC B
programmer the fact that I can even contemplate dedicating over 64K of
memory to a utility is slightly obsene.)
* General purpose. The spelling check is available to any program (or
module) and not restricted to a particular application.
* Multiple language support. British English, {work in progress
American English} are supplied. The language to be used in a given
run can be selected by command. The (changable) default is British
English.
* Fairly fast. On an A5000 the current version could check about 500
words/second when running a test program within a Task Window and
about 620 wps running "native". (It is interesting to know that in the
description of the algorithm in the book mentioned above the speed is
described as about 170 wps on a VAX 11/750 - this was considered fast
at the time the book was written! The VAX version was just under 64K
in size - the dictionary was a bit smaller.)
Bugs and/or Misfeatures
-----------------------
(It's not as bad as it looks. The module is only intended to
implement basic spelling checks; clever preprocessing should be done
by the application program and not set in stone within the module.)
* Does not perform pre- or post-fix stripping.
* Can't cope with many plurals (special case of the lack of post-fix
stripping.)
* Complains about what it believes to be uncorrectly capitallised
words (Many "standard" capitallisations are encoded in the word list.)
* Does not check single character words (i,a,...)
* Currently not possible to supply a personal dictionary to be added
to the standard pre-loaded dictionary.
* The algorithm used can miss a small proportion of bad spellings.
(About 1 in 1000 misWorded words will get through.) This is a result
of the way that the words are encoded -- the error rate could be
reduced at the cost of a larger word table but then the major
advantage of having a small module size (and thus speed) are lost.
* Anagram solver will be limited and quite slow.
* Word finder is limited and quite slow. It operates by using a
brute force search using all possible words that may exist which fit
the supplied partial word. Most of the possibilities are incorrect
spellings so it pushes the checking algorithm to it's limits and thus
reports more incorrect words than it should.
* Difficult to use as a spelling corrector. The module cannot suggest
close matches to a supplied word as the encoding algorithm generates
unrelated hash values for similar words; in addition the original word
list is not available to the module at run-time.
Configurable Bits
-----------------
* The default dictionary language used when the module is loaded can
be changed by altering the "WordChk$DefLang" environment variable in
both !Run and !Boot files.
* New languages can be installed by adding the encoded dictionaries to
the "Languages" sub-directory. They can then be specified as the
default language or loaded with the *WordLoad command.
Command Interface
-----------------
This allows a single word entered from the command line to be checked
against the current dictionary.
*WordCheck <word>
ok/unknown
{Work in progress} This treats the supplied word as an anagram and
tries to rearrange it into words that are found in the current
dictionary.
*WordGram <word>
This takes a word with missing characters (indicated by ?'s in the
string) and tries to find matching words in the current dictionary.
*WordFind <partial word>
This loads a new language as the current dictionary. At the moment
valid languages are "British" {work in progress, "American" and
"Technical".}
*WordLoad <language>
Program Interface
-----------------
The module provides the following SWIs...
"WordCheck_Word"
Input
R0 pointer to string to test (null byte terminated character
string as generated by BASIC V or C)
Output
R0 preserved
R1 returns boolean (-1/TRUE or 0/FALSE)
BASIC Example
SYS "WordCheck_Word","syzygy" TO ,valid%
returns valid% = -1/TRUE (honest)
whereas
SYS "WordCheck_Word","pointer" TO ,valid%
returns valid% = 0/FALSE (shame!)
"WordCheck_Find"
Input
R0 pointer to string to test with up to three '?' characters indicating
unknown characters (null byte terminated character string as
generated by BASIC V or C) To obtain further possible matches use
"WordCheck_FindNext"
Output
R0 preserved
R1 returns first found word or null string if nothing found.
BASIC Example
SYS "WordCheck_Find","te?t" TO ,match$
returns match$ = "teat"
"WordCheck_FindNext"
Output
R1 returns next matching word or null string if nothing found.
BASIC Example
This assumes that "WordCheck_Find" has been called with an initial
partial word of "te?t".
SYS "WordCheck_FindNext" TO ,match$
returns match$ = "tent"
Further matches can be obtained by more calls to "WordCheck_FindNext"
until the end of all possible matches is indicated by the return of a null
string. For instance,in BASIC, to find all matches use code similar to...
SYS "WordCheck_Find","te?t" TO ,m$
WHILE m$ <> ""
PRINT m$
SYS "WordCheck_FindNext" TO ,m$
ENDWHILE
"WordCheck_Load"
Input
R0 pointer to string holding language name to load (null byte
terminated character string as generated by BASIC V or C.) The
corresponding named language file must be present in the Languages
sub-directory within !WordChk.
Output
R0 preserved
R1 returns -1/TRUE if successful otherwise 0/FALSE.
BASIC Example
SYS "WordCheck_Load","British" TO ,ok%
returns ok% = -1/TRUE if language "British" has been loaded.
returns ok% = 0/FALSE if failed to load new language.
Building New Dictionary Files
-----------------------------
[[[ NOT IN THIS VERSION ]]]
A program, BuildDict, is provided which can create new encoded
dictionary files from word lists. These files can then be loaded into
the module. To create a new dictionary you need to do the following...
* Gather a word list. There must be at least 256 words in the list
and there will probably have to be many more words in order that
the difference between the hash values is always < 64K. There
should not be more than 33000 words in the list.
* Delete single character words and ensure that there are no
leading or trailing spaces or tabs at the end of the words.
There should only be one word per line.
* Sort the list (not essential for BuildDict but needed for
following step.)
* Delete duplicate words.
* Place the file in the WordLists sub-directory of !WordChk.
* Run the BuildDict program. This will create, if successful, an
encoded dictionary file in the Languages sub-directory. There are
a number of possible fatal errors that may occur during
processing.
* Change the WordChk$DefLang value set in !Boot and !Run to make
your new language the default or use the *WordLoad command to
load the new language into a running module.
The hash algorithm has been optimised for UK/US English. It not be
suitable for other languages. A future version of !WordChk may
include a means to alter and re-optimise the hash algorithm if
necessary for each language to be loaded.
Foreign Languages
-----------------
True spelling checkers for foreign languages are complicated by the
fact that most of them care about the 'sex' of the words. Some of
them are so regular that native writers rarely make spelling errors
other than simple 'typing' errors ie transposition of characters. Some
languages insist on strange characters that do not appear on the
keyboard. The Arc copes quite well with the strange characters for
languages such as french and German. Languages such as Esperanto are
not so well provided for as the accents appear on unexpected
characters and special provision would have to be given to defining
them.
In any case WordChk is just a word checker and not a full spelling
checker (the difference is that one just tries to match a word to one
in a list, the other attempts to manipulate the word in various ways
to attempt to find the root.)
British {word list being repaired}
American {word list being repaired}
Computing {work in progress}
Hacking {work in progress}
French need smaller word list!
German need word list.
Italian need word list.
Esperanto can't display accented characters from default font.
Latin need word list (getting a bit weird here?)
=====================================================================
The legal bit: I don't care what kind of ...ware it is called but I
retain copyright on the code and encoded dictionary table used in this
particular RiscOS implementation of the spelling checker algorithm.
You can distribute version 0.03 of the WordCheck module and
associated files as far and wide as you wish so long as this README
file is also distributed with the module and hash file. You may
include the !WordChk application (or just the language, Wordchk
module and README files) within another application that makes use of
its facilities. If you paid money (other than a small amount for disc
duplication and postage) for these files then you've been ripped off.
As noted above, this code is in alpha test and you take your own
chances with bugs, spelling errors etc.
======================================================================